python [lxml] - cleaning out html tags
        Posted  
        
            by sadhu_
        on Stack Overflow
        
        See other posts from Stack Overflow
        
            or by sadhu_
        
        
        
        Published on 2010-06-01T13:28:34Z
        Indexed on 
            2010/06/01
            13:43 UTC
        
        
        Read the original article
        Hit count: 220
        
from lxml.html.clean import clean_html, Cleaner
    def clean(text):
        try:        
            cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,
                      remove_tags = ['a', 'li', 'td'])
            print (len(cleaner.clean_html(text))- len(text))
            return cleaner.clean_html(text) 
        except:
            print 'Error in clean_html'
            print sys.exc_info()
            return text
I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt appear to work as such, i'm still left with a substial amount of markup (and it doesnt appear to be broken html), and particularly links, which aren't getting removed, despite the args i use in remove_tags and links=True
any idea whats going on, perhaps im barking up the wrong tree with lxml ? i thought this was the way to go with html parsing in python?
© Stack Overflow or respective owner